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Abstract. Regular expression patterns are a key feature of document 
processing languages like Perl and XDuce. It is in this context that the 
first and longest match policies have been proposed to disambiguate the 
pattern matching process. We formally define a matching semantics with 
these policies and show that the generally accepted method of simulating 
longest match by first match and recursion is incorrect. We continue by 
solving the associated type inference problem, which consists in calculat- 
ing for every subexpression the set of words the subexpression can still 
match when these policies are in effect, and show how this algorithm can 
be used to efficiently implement the matching process. 
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1 Introduction 



Using regular (tree) expression patterns to extract relevant data from a string 
(or tree) is a highly desirable feature for programming languages supporting 
document transformation or data retrieval. Indeed, it is a core feature of Perl 
[1] and has recently been proposed in the context of the XML programming 
language XDuce [2,3]. The matching process consists of two parts: (1) ensuring 
that the input belongs to the language of the expression; and, (2) associating 
with every subexpression the matching part of the input. In general, patterns 
can be ambiguous, moaning that there are various ways of matching the input, 
resulting in multiple associations. When regular expression patterns are used as 
database queries, it is indeed common and desirable for a pattern to have many 
matches in the data, and to be able to retrieve all of them. However, in general- 
purpose programming using pattern matching as in ML, Prolog, or XDuce, we 
normally want unique matching and a deterministic semantics. 

One approach to the latter problem would be to simply disallow ambigu- 
ity by requiring the regular expressions to be unambiguous [4]. Another, more 
programmer-friendly approach is to allow arbitrary regular expression patterns, 
but to employ a special unique matching semantics. In this paper, we investigate 
this last approach on strings. We present a formal definition of unique matching, 
and give a sound and complete algorithm for solving the associated regular type 
inference problem: given a regular expression P and a regular "context language" 
C, compute for each subexpression P' of P the regular language consisting of all 
subwords w' of an input string w £ C, such that w' is matched by P' when 
matching w uniquely to P. 

Regular type inference is useful for type-checking transformations: given an 
input language, does the transformed document always adhere to a desired out- 
put language [5-8]? An important feature of our approach is that it directly 
yields an unambiguous NFA that not only contains the types of all the subex- 
pressions of the given pattern, but also serves to perform the actual matching 
on any given string in linear time. 

Rcgiilar expression pattern matching and its type inference problem was first 
studied in the context of XDuce, an XML processing language [2]. While enor- 
mously influential, it suffers from a few disadvantages. First, the XDuce type 
inference algorithm is incomplete. Also, the formalism used to represent regular 
tree languages (in terms of linear context-free grammars with encoding into bi- 
nary trees) is hardwired into the algorithm, making it very syntactic in nature 
and hard to understand. Our aim is to abstract away from a particular syntax 
of regular languages. We therefore present a sound and complete algorithm us- 
ing only operations on languages. As such, our algorithm is independent of a 
surrounding (regular expression) type system. 

Another problem with the XDuce approach is the introduction of a miscon- 
ception regarding unique matching of regular expression patterns. Namely, the 
longest match policy used to disambiguate the Kleene closure is simulated by 
recursion and the first match policy, which is used to disambiguate disjunctions. 
We will show that this simulation is incorrect. 



Two recent foUowups on XDuce, which happened concurrently and indepen- 
dently with our own work, are CDuce and A''° [6,9, 10]. While both approaches 
claim complete type inference, they follow XDuce in simulating longest match 
by first match and recursion. We will show that this causes the inferred types to 
be incorrect with regard to the longest match policy. Another advantage of our 
approach is the elementary nature of our type inference method, which works 
purely on the language level and which yields to a reasonably simple correctness 
proof. 

The rest of this paper is organized as follows. In Sect. 2 we formally define 
the matching relation based on two disambiguating rules: first match and longest 
match. It is shown that the above-mentioned simulation of longest match by 
first match and recursion is incorrect. In Sect. 3 we introduce the type inference 
problem and give a declarative way of solving it. A concrete implementation 
strategy is given in Sect. 4, where we also show that this strategy leads to an 
efficient implementation of the matching process. The last section touches on 
some future work. 

2 Unique Pattern Matching 

Matching against regular expression patterns is done in two parts: (1) making 
sure that the input word belongs to the language of the expression; and, (2) as- 
sociating with every subexpression the matching subword. These associations 
can then be used to extract relevant data from the matched string. However, 
regular expressions can be ambiguous [4]. For instance, when we want to show 
that aa is matched by a* ■ (a + e) . should we associate aa to a* and A to (a -I- e) 
or should we associate a to a* and a to (a + e)?^ And what if we consider a 
being matched by a -I- a, should we associate a with the first a or the second? 
Furthermore, how should we deal with the matching of aab by (a + 6 -f- a • 6)*? In 
order to get a unique matching strategy, we define the ^-operator to be "greedy" , 
meaning it should match the longest possible subword still allowing the rest of 
the pattern to match. This is referred to as the longest match semantics in Perl, 
XDuce, CDuce and A''*'. Furthermore, for patterns like Pi -I- P2, we opt for the 
first match policy where we only associate the subword with P2 if it cannot be 
matched by Pi. Finally, we treat P* as being atomic, in the sense that we do not 
give associations for subexpressions of P. In this section, we give formal defini- 
tions of patterns and the matching process and show that the proposed policies 
guarantee a unique matching strategy. 

We assume to be given a fixed, finite alphabet S which does not contain 
the special symbols _L and □. Elements of S will be denoted with a and words 
over S will be denoted with w throughout the rest of this paper. A regular 
expression pattern P is a regular expression over S. That is, P is either of the 
form a with a G 17, Pi -I- P2, Pi • P2 or P^, where Pi and P2 are already regular 
expression patterns. We define all operators to be right-associative. The set of 

^ To avoid confusion we denote the empty word with A and the regular expression 
recognizing A with e. 
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Fig. 1. Left: The syntax-tree representation of P = (a + a*) ■ a* ■ (a + e). The bind- 
able nodes have their addresses annotated. Right: The associations resulting from the 
matching of P against aaaa. Nodes that are not mentioned are associated with _L. 



all patterns is denoted by V. Because we will consider the abstract syntax tree 
of a pattern, wc abuse notation slightly and identify P with the partial function 

P : {1, 2}* {*, •, +,e}US such that 

- if P = £ then dom(P) = {A} and P(A) = e, 

- if P = cr with a G E then dom(P) = {A} and P(A) = a, 

- if P = Pi +P2 then dom(P) = {A}U{ln | n e dom(Pi)} U {2n | n G doni(P2)} 
with P(A) = +, P(ln) = Pi(n) and P(2n) = P2(n), 

- if P = Pi • P2 we make a similar definition, only P(A) = •, and 

- if P = P* then dom(P) = {A} U {In \ n € dom(Pi)}, P(A) = * and P(ln) = 
Pi(n). 

Intuitively the function view of a pattern describes the abstract syntax tree of 
its regular expression, as shown in Fig. 1. Elements of {1,2}* are called nodes 
and will be denoted by n, m and their subscripted versions. We say that node 
n is an ancestor of m if there is some n' ^ X for which m = nn' . If to = nl 
(to = n2) then to is the left (right) child of n. A node n G dom(P) is a bindable 
node of P if it does not have an ancestor labeled with *. The set of bindable 
nodes of P is denoted with bn(P). We define the size |P| of P as the cardinality 
of its domain. 

The matching process is formally described by the matching relation w & P ^ 
V , signifying that w is matched by P yielding associations V . Here, is a function 
from bn(P) to subwords of w or to the special symbol _L. Intuitively, V{n) = w' 
if the pattern rooted at node n is responsible for matching the subword w'. It 
is -L if the subpattern is not responsible for recognizing any subword of u>. To 
simplify the definition of the matching relation we introduce some notation. If 
Vi and V2 are such functions, then we use Vi + P2 to denote the function for 
which (Pi + P2)(A) = Vi{\), {Vi + P2)(ln) = Vi{n) for every n G dom(Pi) and 
{Vi + P2)(2n) =_L for every n G bn(P2). We define Pi + V2 similarly. Moreover, 
Vi ■ V2 is the function such that [Vi ■ V2){X) = Vi{X) ■ F2(A) if Vi{X) 7^_L and 
V2{X) ^_L, and it is _L otherwise. Furthermore, (Vi • V2){ln) = Vi{n) for every 
n G dom(Vi) and {Vi ■ V2)(2n) = V2{n) for every n G dom(V2). 

The inference rules for w G P 1/ are given in Fig. 2. We write u> G P 
if w € P ^ V holds for some V and w ^ P otherwise. The auxiliary relation 
{wi,W2) G Pi • P2 (Vi, V2) is used to indicate that when matching ■u;iU'2 by 
Pi •P2, pattern Pi is responsible for matching wi , yielding associations Vi , while P2 
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Fig. 2. The matching relation 



is responsible for matching W2 , yielding associations V2 ■ The first match policy is 
implemented in rules Or2 and C0r2 where we do not allow to examine the second 
branch of a disjunction until the first one fails to match. The longest match 
policy is expressed in rule CKleene. Figure 1 shows the associations obtained by 
matching {a + a*) ■ a* ■ {a + s) against aaaa. 

The language L(P) of a pattern P is defined as the language of its regular 
expression. We can then obtain the following theorem: 

Theorem 1. The matching relation of Fig. 2 is well defined: 

1. The matching relation is scmantically correct: w £ P --^ V iff w S L{P), and, 

2. The matching relation is unique: if w € P V and w G P W then 
V = W. 

Proof (Sketch). The "if" part of (1) can be proved by induction on the matching 
derivation. To prove the other way around, we first define the relation □ C P x P 
where Zl relates a pattern with its immediate sub-patterns if P 7^ (Pi •P2) - Pa and 
P (Pi+P2)-P3. We define □ to relate (Pi-P2)-P3 withP2-P3 and withPi-(P2-P3) 
and to relate (Pi + P2) - Pa with Pi - Ps and with P2 - Pa. The monotone embedding 
(p into N X N where (/)(P) = (|P|,0) if P 7^ Pi • P2 and 0(Pi • P2) = (|Pi • P2I), |Pi|) 
otherwise, shows that □ is a well-founded ordering on V. The proof then goes 
by well-founded induction on (P, Statement (2) is proved by induction on 
both matching derivations. □ 

In related work [2, 3, 6, 9], the longest match policy is simulated by the first 
match policy and recursion. We will now show that this simulation is incor- 



rect. Concretely, consider the inference rule CKleene' for Kleene Closure in a 
concatenation from A''*' (XDuce and CDuce use equivalent rules). 

CKleene' 

iwi,W2) g ((Pi ■K) + s) ■P2 (Vl,V2) 

(W1,W2) e ■P2 iVl,V2) 

Here, we assume w.l.o.g. A ^ Pi. The proposed intuition behind this rule is that, 
when trying to derive w e P* • P2 ~^ 1^, we will be forced by the first match 
policy to consider (Pi ■ P*) • P2 before e • P2 at every expansion of P* • P2. Since 
A ^ Pi , this should require us to split w into wi G Pi and W2 € P2 such that W2 
is the smallest suffix of w still matched by P2. However, this is a false intuition. 
Indeed, because the first match strategy continues to be used in Pi, it is possible 
that P2 is allowed to start matching before a longer matching alternative in Pi is 
considered. For example, consider the matching of ab against P = {a+a-b)*-{b+e). 
By the longest match policy, we would expect {a + a ■ b)* to be associated with 
ab. Indeed, this is the unique association derived by our rules of Fig. 2. Rule 
CKleene', however, will incorrectly derive the association of a to (a + a ■ 6)* and 
6 to {b + s), as we show next. 



(a, 6) e (a + a- 6) ■ {(a + a-b)") ■ (b + e)) {V^,W) 

(A, b) £ ((g + g • 6)") ■ (b + e) (Vj, V2) ^^^^ 

(a, b) g ((g + g . 6) ■ (g + g ■ 6)') ■ (6 + e) (Vi = v[ ■ V,', Vi) ^^^^ 

(g,6) e (((a + g- &) ■ (g + g- b)*) +e) • + (Vi, V2) 
okleene 

(g,6)e(a + a.b)-.(b + e)^(yi,y2) ^^^^^ 

gfe e P ~* Vi ■ ^2 

The derivation for the first subgoal (a, 6) S (a + a • 6) ■ ((a + a • 6)*) • (6 + e)) 
(y/, W) must look like 



Lab '—^ CElem 



a e a y," = [A ^ g] fc G (a + a ■ fc)* • (& + e) W 

Y, CLab 

(g. fc) e ■ ((a + a ■ fc)* ■ (b + e)) ^ {V[' , W) ^^^^ 

(a, b) e (a + a ■ b) ■ ((g + g • b)* • (b + e)) {V[ = Vj" + (a ■ b), W) 

Here, the subderivation indicated by the dots above the use of CElem is iso- 
morphic to the derivation for the second subgoal (A, 6) € (a + a6)* • (6 + e) 
(^2,1^2): 



-, r Lab 

b G 6 ^2 = A ^ fc 

OrI 

b G (b + e) ^ y2 

CEmpty 



(A, b) G e ■ (b + e) -w (y^' = [A ^ A], V2) b ^ ((a + a ■ b) ■ (a + a ■ b)*) ■ (b + e) ^^^^^ 

(A, b) G (((g + a ■ b) ■ (a + a ■ 6)") + £) ■ (b + £) ^ (yj, y2) CKleene' 

(A, b) G (a + a • 6)* • (b + e) (yj , V2) 

Now (Vi ■ V2)(l) = a and (Fi • V2)(2) = 6, as we wanted to show. 



3 Type Inference 



The matching process described in the previous section is used in many practi- 
cal languages (including Perl) where the associations are used to construct the 
output. Recently, there has been growing interest to add type safety to such 
languages: given an input language, the transformation should always produce 
outputs adhering to a certain output language [5-7] . In one approach to achiev- 
ing this, one has to infer for every subexpression in the pattern the set of words 
it is capable of matching. In this section we present an algorithm for this type 
inference problem. We also note that the existing type inference algorithms are 
incorrect with regard to the longest match policy; this will follow from the in- 
correct simulation of longest match by recursion and first match, as shown in 
the previous section. 

Let C be a set of words called the context. The type of a bindable node n 
in P relative to C, denoted as T(n, P, C) is the set of words w for which there 
exists some w' G C such that w' € F ^ V and V{n) = w. The main result of 
this paper can be stated as follows: 

Theorem 2. If C is a regular language then T{n,P,C) is also regular, and can 
be effectively computed. 

The algorithm is obtained by structural induction on the pattern, apply- 
ing the equalities we introduce in the following lemmas and propositions.^ For 
instance, the next lemma shows how to calculate T(A,P,C): 

Lemma 1. T(A, P, C) = L{P) n C for any pattern P. 

Proof. By a simple induction on the matching derivation we can prove that if 
w eP V then V{\) = w and if (wi, W2) £ Pi • P2 (Vi, V2) then V^i(A) = Wi 
and V2{X) = W2. Combining this observation with Theorem 1 gives the desired 
result almost immediately . □ 

Note that this result completely solves the type inference problem when P equals 
e, cr or PJ, since bn(P) = {A} in these cases. When P is of a different form, we will 
calculate T(n,P, C) from T(n',P',C") for some simpler pattern P' and possibly 
different context C". In the case where P = (Pi • P2) ■ P3 we will need the set 
M(n,P,C) = {wiUw2 I 3m;' e C,w;' e P ^ V,V{nl) = wi,V{n2) = W2} to be 
defined on all bindable nodes of P which are labeled with a concatenation. We 
will also show how to calculate this set. Intuitively, the symbol □ specifies how 
W1W2 is "broken up" into subwords when it is matched by the concatenation at 
node n. 

The case where P = Pi -|- P2 is handled by the following proposition: 

Proposition 1. For P= Pi + P2, the following equalities hold: 

1. T(A, P, C) = T(A, Pi, C) U T(A, P2, C - L{Pi)) 

^ Proofs of the claims in this section are given in Appendix A. 



2. T{ln,P,C) = T{n,Pi,C) 

3. T(2n, P, C) = T{n, P2, C - 

4. M{ln,P,C) = M{n,Pi,C) 

5. M{2n, P, C) = M(n, P2, C - L(Pi)) 

The next two propositions handle P = £-P2 and P = a--P2. Here, the left quotient 
of language L by language K, defined bs {s \3p & K : ps & L}, is denoted as 

K\L . The right quotient of L by K ^ defined as {p \ 3s G -fC : ps G i}, is denoted 
as L/K. It is well-known that regular languages are closed under both quotients 
[11]. 

Proposition 2. If P = e ■ P2, the following equalities hold: 

1. T{l,P,C)^T{X,e,C/L{P2)) 

2. T{2n,P,C)=T{n,P2,C) 

3. M(A, P, C) = r(l, P, C) ■ {□} • r(2, P, C) 
^. M{2n,P,C) = M{n,P2,C) 

Proposition 3. If P = a ■ P2, the following equalities hold: 

1. T{l,P,C)=T{X,a,C/L{P2)) 

2. T(2n, P, C) = T{n, P2, {L{a)\C)) 

3. Mix, P, C) = r(l, P, C) • {□} • T(2, P, C) 

4. M{2n,P,C) = M{n,P2,L{a)\C) 

For the case where P = P^ • P2 the situation is a bit more involved; 
Proposition 4. When P= P^ ■ P2 the following equalities hold: 

1. T(1,P.C) = Ti 

2. T{2n,P,C)=T{n,P2,C2) 

3. M(A, P,C)=I 

4. M{2n,P,C) = M{n,P2,C2) 

Here, Ti = {p e L{P{) \ 3s e L(P2) : ps e C A c} , C2 = {s e ^(Pa) | 3p G 
L{P^) : ps e C Ac} and I = {pDs \ ps € C A p e -L(Pf) As & L{P2) A c} with 
c = -i(3w3, W4 ■ W3 ^ XA wsW4 = s Apws G L{Pl) AW4 G L(P2)). 

Of course, this proposition is of little use if we cannot calculate Ti, C2 and I. The 
next lemma gives one possible way of calculating them and also shows that they 
are regular if C is. We denote U {□} by Z'q and let n be the homomorphism 
from Efj to S with 7r(cr) = a for every a G S and 7r(n) = A. Clearly, if L is a 
regular language, so is ■jt~^{L) = {w £ \ 7r(w) G L}. 

Lemma 2. Using the notation of Proposition 4, and writing L{Pl) as Li, L(P2) 
as L2 and 7r~^(Li) — [Li ■ {□}) as A, we have: 

- 1 = ^-\C) n ((Li • {□} -12) -A- L2), 

- Ti = //({□} • L2), and 
-C2 = (Li • {U))\I. 



Proof. By definition, (Li • {□} ■ L2) — A ■ L2 equals 

{wi\3w2 \ wi G Li A W2 & L2 A -i(3t;i, : V1V2 = 11)10102 A vi € A A V2 € L2)} 

Or, more elaborately, 

{wiDw2 \ Wi G Li A •W2 & L2 A -i(3wi,i>2 : V1V2 = •WiD'W2 
A Tr{vi) € LiA (Vp G Liiviy^ pO) AV2& L2)} 

We show that this equals 

{wi\I\w2 \wi & Li A W2 & L2 A 

-■(3^3, IU4 : ws ^ X AW2 = W3W4 A 7r(winw3) € Li A W4 € L2)} 

We can see this as follows. Suppose •wiDw2 is in the upper set and suppose 
that there do exist W3 and u>4 such that W2 = ?JA3''Mj W3 7^ A, 7r('u;inix;3) £ Li 
and W4 G L2- Then take vi = wiOwz and V2 = W4 to see that Winw2 cannot be 
in the upper set, a contradiction. On the other hand, suppose wiDw2 is in the 
lower set and suppose that there do exist vi and V2 such that V1V2 = wiOw2, 
Tr{vi) G Li, \/p G Li : vi ^ p\3 and V2 G L2. Since 112 G L2 and L2 is a language 
over S, V2 cannot contain the symbol □. Since V1V2 = wiDw2, V2 must be a 
suffix of UJ2. Hence, we can divide u>2 in and such that vi = wiDw:^ and 
V2 = W4. Since vi ^ for any p, must be different from A. Moreover, we 
immediately have -KiwsOw^) = 7r(?;i) G L\ and W4 = V2 G L2, which gives us a 
contradiction. 

As a consequence, 7r^^(C) D {{Li ■ {□} • L2) — A ■ L2) must equal 

{winw2 I 7r(u>inw2) G C Awi G Li Aw2 G L2A 

-1(3^3, W4 : ^3 ^ A A W3'u;4 = W2 A 7r(u'inw;3) G Li AW4 G L2)} 

Since wi and W2 do not contain □, W1W2 = '!T{win\w2) G C. By the same 
reasoning ■u;i-u;3 = tt{wiO'W3) G Li. Hence, 7r^^(C)n((Li ■ {D} ■ L2) — A- L2) = I, 
as desired. With c as in Proposition 4 we obtain the other two desired equalities: 

//({□} • L2) = {p I 3s e L2 : pDs G 1} = {p\3s G L2:ps gC ApG L-i_Ac} = Ti 
(Li • {D})\I ={s\3pGLi: pDs G 1} = {s \ 3p G Li : ps G C A s G L2 A c} = C2 

□ 

The case P — (Pi • P2) • P3 is handled as follows: 

Proposition 5. If P = {Pi-P2)-Pz and P = Pi-{P2-Pi), the following equalities 
hold: 

1. T{lln,P,C)=T{ln,P,C) 

2. T{12n,P,C) =T{21n,P,C) 

3. T{2n,P,C) =T{22n,P,C) 

4. M{2n,P,C) = M{22n,P,C) 



5. M{Un,P,C) = M{ln,P,C) 

6. M{12n,P,C) ^ M{21n,P,C) 

7. M(A, P, C) = {W1W2U ■ W3 I winw2nw3 G J} 

8. M(i,p,c) = j/({n}-r*) 

9. T{l,P,C) = 7r{M{l,P,C)) 

Where J = {winw2nw3 \ winw2W3 G M{X,P,C),W2nw3 G M{2,P,C)} 

Wc note that J is regular if M(A, P', C) and M(2, P', C) are. Indeed, to recog- 
nize a word in J we simply start the automaton for M(A, P', C). When we read 
the first □ we also start the automaton for M(2,P',C), running both automata 
in parallel, and modify the transition relation of M(A,P',C) to allow an extra 
□ to be read. We accept if both automata are in a final state. 

Finally, we treat (Pi + P2) • P3: 

Proposition 6. If P = {Pi + P2) ■ P3 and P = Pi ■ P3 + P2 ■ P3 the following 

equalities hold: 

1. r(i, p, c) = r(ii, p', c) u T(2i, p, c) 

2. T(11,P,C) =T(ll,P',C) 

3. T{12,P,C) = T{21,P,C) 

4. T(2, P, C) = r(12, P,C)U r(12, P, C) 

5. M{\,P,C) = M{1,P,C)UM{2,P,C) 

6. Af(lln,P,C) = M(lln,P',C) 

7. M(12n,P,C) = M(21n,P',C) 

8. M{2n, P, C) = M(12n, P, C) U M(22n, P, C) 

The type inference algorithm announced in Theorem 2 now works as follows if 
C is regular. For the base cases e, a and P*. Lemma 1 allows us to calculate every 
type, which must be regular. For the other cases, the propositions above dictate 
how to calculate the types by recursion, using only regular operations on regular 
sets. The algorithm can be seen to use the well-founded D-ordering on patterns 
introduced in the proof of Theorem 1 in its recursion, from which its termination 
follows. However, the observant reader will note that the last proposition relates 
(Pi -I- P2) • P3 with the □-incomparable Pi • P3 -I- P2 • P3. Termination still follows if 
we modify the algorithm to combine the results of Proposition 6 with the ones 
of Proposition 1 to calculate the type of (Pi + P2) • P3 by recursion on Pi ■ P3 and 
P2 -Ps. 

Related work [6, 9, 10] claimed sound and complete type inference algorithms 
for the matching relation with rule CKleene' described at the end of the previous 
section. Using the same counterexample pattern ? = {a -\- a ■ h)* ■ {h + e) and 
string ah, these algorithms must compute T(l, P, {ah}) = {a} and T(2, P, {ah}) = 
{h}. In contrast, our algorithm correctly computes T(1,P, {a6}) = {ah} and 
T(2, P, {ah}) = {A} in accordance with the longest match policy. 

4 Unifying Type Inference and Matching 



The process of matching w by P (and computing the resulting associations) 
can naively be implemented by evaluating the matching relation of Fig. 2 in a 



syntax-directed manner. This approach, however, is inefficient. When we want 
to match w by P^^ • P2, we have to create subdivisions of w into wi and W2 
satisfying the premises of rule CKleene. Since there are \w\ possible divisions, 
and since checking the premises for a possible division requires us to match wi 
by PJ and 11)2 by P2, every letter of w is scanned at least \w\ times in the worst 
case scenario, giving i7(|wp) time complexity. We will now show how the type 
inference algorithm of the previous section can be used to compile a pattern P 
into an NFA that will allow us to execute the matching process in 0{\w\) time. 
As a bonus, the computed NFA contains the inferred type of every node in P. The 
NFA can be at least exponentially larger than P, but this is not abnormal; indeed, 
the same happens in ML when the input is only allowed to be investigated once 
[12,13]. 

A non- deterministic finite automaton (NFA) A is a tuple {Qa,IatFat5a) 
where Qa is a set of states, I a C Qa is the set of initial states, Fa C Qa is 
the set of final states and 5 : Qa x Z' U {A} x Qa is the transition relation. An 
accepting run of A on «; = cji . . . cr„ is a sequence (go, ^o), • • • ) (<?m, km) where 
ko = 0, qo € Ia, km ^ n, € Fa and for every i either 6Aiqi,o-k,+i,qi+i) 
with ki+i = ki + 1 or 6A{qi, A, qi+i) with h+i = ki and qi ^ qi+i. The language 
of A is the set of words for which an accepting run exists and will be denoted 
by L{A). A NFA is deterministic (a DFA) if I a is a singleton set, for every q 
and cr there exists exactly one q' for which S{q,a,q'), and 6{q,X,q') iS q = q'. 
If 5 C Qa and r — (gi, /ci), . . . , {qm, km) is a run of A on some word w, then 
POs(t, S') = {ki I qi e 5*}. If pos(t, 5) 7^ then t\s is the couple (?,j) for 
which i = min(pos(T, S)) and j = max(pos(T, S)). It is (—1, —1) otherwise. The 
subword of w = ai . . . (t„ bounded by {i,j), denoted as is ct^+i ■ ■ - (Tj if 

< * < J < and -L otherwise. 

If Ai and A2 are NFA's, we write Ai ■ A2 for the automaton recognizing 
L{Ai) ■ L{A2) obtained by connecting the final states of Ai to the initial states 
of A2 by A-transitions. We write Ai UA2 for the automaton recognizing L{Ai) U 
L{A2) obtained by taking the tuple- wise union of Ai and A2, and we write 
Ai n A2 for the automaton recognizing L{Ai) n L{A2), obtained by the product 
construction. In particular, Ai n A2 has Qai the set of states. We 

denote the minimal DFA recognizing {□} as A^j and let Tr{A) be the automaton 
where we transform every D-transition of A into a A-transition. 

Algorithm 1 computes (recursively) the hyperautomMon H{P, C) — {A, f) 
for pattern P and context C. Here A is an NFA and / is a fmiction relating 
bindable nodes n of P to triples {Qm I-mFn), where Q„, /„ and arc all subsets 
of Qa- We use ((5i,/i,Fi) x Q2 to denote (Qi x Q2,h x Q2,Fi x Q2) and 
(Qi, Ji, Fi) U {Q2,h, F2) to denote (Qi U Q2, h U h, Fx U ^2). We now state: 

Theorem 3. The hyperautomaton {A, /) computed by Algorithm 1 has the fol- 
lowing properties, where f{n) = {Q„, I„, Fn): 

1. L{QmIn:FmSA)=T{n,P,C) 

2. The automaton A is unambiguous: for every w there is at most one accepting 
run of A on w. 



Algorithm 1 Calculate the hyperautomaton H{P,C). 



1: if P = e, P = 0- or P = Pi: then 

2: compute a DFA A, recognizing I/(P) PI C 
3: return {A, f) with /(A) = {Qa, I a, Fa) 
4: else if P = Pi + P2 then 

5: compute {Ai, fi) = H{Pi,C) and (A2, /2) = (P2, C - L(Pi)) 
6: let A = AiUA2 

7: return {A, /) whh /(A) = (Qa, /a, Fa), /(In) = /i(n) and /(2n) = /2(n) 
8: else if P = Pi ■ P2 with Pi = e or Pi = a then 

9: compute (Ai,,h) = H{Pi, C/L(P2)) and (A2, h) = H{P2, L{Pi)\C) 
10: let ^ = ^1 • ^2 

11: return {A, f) with /(A) = (Qa, /a, -Fa), /(In) = /i(n) and /(2n) = /2(n) 
12: else if P = Pj ■ P2 then 

13: compute DFA's Aj and At^ for I and Ti as defined in Proposition 4 
14: compute {A2, /2) = H{P2, C2) with C2 as in Proposition 4 
15: let ^ = 7r((^Ti -^n • A2)n^/) 

16: return {A, f) where /(A) = (Qa,/a,Fa), /(I) = (Qn , M , Ft^ ) x Qa, and 

/(2n) = /2(n) X Qa, 
17: else if P = (Pi • P2) ■ P3 then 

18: compute {A', /') = H{Pi ■ (P2 ■ P3), C) and let f'{n) = (Q^, F^). 

19: let Qi = Q'l U Q^i , h = (Q'l U Q^i) n U' and Fi = {g e F:^i | 3q' G 1^2 ■ there is 

a A- labeled path from q to q and there is a path from g' to some state in Fa' } 
20: return {A!, /) where /(A) = /'(A), /(2n) = /'(22n), /(lln) = /'(In), /(12n) = 

/'(21n) and /(I) = (Qi, Ji, Fi) 
21: else if P = (Pi + P2) ■ P3 then 
22: compute {A!, /') = H{Pi • P3 + P2 ■ P3. C) 

23: return {A, f) where /(A) = /'(A), /(U) = /'(U), /(12) = /'(21), /(I) = 

/'(II) U /'(21) and /(2) = /'(12) U /'(22) 
24: end if 



3. For every w: w G C and w ^ P ^ V holds iff there exists an accepting run 
T of A on w and V{n) = WtIq^ for every n € bn(P) 

The proof is in Appendix B. Part (3) allows us to derive the associations effi- 
ciently if we are given the accepting run t on w = ai .. .ai. The problem is that 
we are not given this run, but need to compute it. How can we best determine 
V{n) in that case? We first compute the sets of states So,. . . ,Si where So = I a 
and Si+i = 6A{Si,(Ti+i). Here dA{S,a) is defined to be the set of states that 
we can reach from a state in S through a path labeled with a. li Si C] Fa 7^ 
then w is matched by P. We can then compute Sq, ■ ■ ■ ,S!^ with S'^ = Si r\ Fa 
and S^Li = 6^^{Sl,cTi) n where 6^^iS',a) = S ii Sa{S,(7) = S'. Intu- 
itively, S'^ contains those states that can be reached from a start state with a 
path spelling ai . . .a^ and from which we can reach a state in Fa through a 
path spelling f7,+i . . .at. Note that we must be able to order the states of S- 
into a sequence qi, . . . ,qk such that S{qj, A, qj+i) or we can construct multiple 
accepting runs, contradicting (2). In particular, if g^., • • • is the ordering for 
S'i, then T = (9?, 0), . . . , (q^ , 0), {ql,l),---, {q^ A), ■ ■ ■ AliJ), ■ ■ ■ ilk, . must be 



the accepting run. Clearly, t\q^ — iS S[ C\Qn Sj CiQn ^ ^ and there 
is no m < z or m > j for which S"^ n Qn 7^ 0- So, it suffices to search the first 
and last i for which S*- fl Qn ^ to obtain t\q^, which gives V{n) by (3). Since 
A is constant, all the calculations can be done in 0(|w|) time, which gives us an 
efficient matching implementation. 

5 Future Work 

We note that Perl has two ways of disambiguating the Kleene closure in a con- 
catenation. One of them is the longest match strategy presented here, making the 
^-operator "greedy" . Another version of the *-operator, denoted *? in Perl, has 
a shortest match semantics. That is, it will match the smallest prefix that still 
allows the rest of the word to be matched. The matching relation and the type 
inference algorithm can be expanded in a straight-forward manner to include 
this operator. 

Although there has been extensive research on the regular type inference 
problem, there has not yet been a formal investigation of the inherent time com- 
plexity bounds. While we conjecture that our algorithm executes in 2EXPTIME, 
a formal proof still has to be given. As we noted in the previous section, the con- 
struction of i?(P, C) can involve an exponential blowup. A tight upper bound 
(single-exponential, double-exponential or more) still has to be found. 

The true power of regular expression pattern matching emerges when we in- 
troduce tree patterns matching unranked hedges, in the context of XML. We are 
currently trying to expand the matching relation and type inference algorithm 
to this end. It will be interesting to see if we can also unify type inference and 
pattern matching in this setting. 

Acknowledgments I thank Jan Van den Bussche, Dirk Leinders, Wim Martens 
and Frank Neven for inspiring discussions and for their comments on a draft 
version of this paper. 
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A Proofs of claims in Sec. 3 



For completeness' sake we present the proofs of the various propositions in Sec. 3. 
A.l Proof of proposition 1 

Proof. U w G C with w € P V then there are two possible matching deriva- 
tions. The first one has the form 



w £Pi V 

OrI 

w € Pi + P2 + P2 

From which we immediately obtain w G Pi ^ V. Now T(ln, P, C) C T{n, P, C) 
and M(ln, P, C) C M{n, Pi, C) since {V + P2)(ln) = V{n) and since V{ln) =_L 
in the other derivation. On the other hand, if w G Pi ^ F then we can cre- 
ate the matching derivation above, which means T(n, Pi, C) C T(ln, P, C) and 
M(n, Pi, C) C M(ln,P,C). 

The second matching derivation looks like 



W €P2 ^ V W ^ Pi 
Or2 

«; G Pi P2 ~^ Pi + 1^ 

So, w G C - L(Pi) and w € P2 V. Because (Pi -|- V){2n) = V{n), we 
have T{2n, P, C) C T(n, P, C - i(Pi)) and M{2n, P, C) C M(n, P2, C - L(Pi)). 
On the other hand, if w G C — i(Pi) and w G P2 1^, we can create the 
matching derivation above to obtain w G P ~^ Pi -|- F, so T(n,P2,C — L(Pi)) C 
T(2n, P, C) and M(n, P2, C - i(Pi)) C M(2n, P, C - i.(Pi)). 

A. 2 Proof of proposition 3 

Proof. Define Ci = C/L(P2), C2 = L{a)\C. If w G C with weP-^V then the 
top of the matching derivation must looks like 

a € a Vi W2 G P2 ~^ V2 
: ^ : ^ CLab 

(<T,W2) G P {Vl,V2) 

CElem 

w = crw2 G P Vi • V2 

Since ct G Ci and cr e cr Vi, T(1,P,C) C T(A,Pi,Ci). If wi G T(A,ct,Ci), 
then W\ G L{a) and there must exist some W2 G L(P2) for which ^1^2 G C. 
Then we can reconstruct the matching derivation above using theorem 1.1 and 
so equality (1) holds. 

Since W2 G C2 and ^2 G P2 V"2, T (2n, P, C) C T (n, P2, C2) and M(2n, P, C) C 
M(n, P2,C2) hold. On the other hand, if W2 G C2 then by definition aW2 G C. 



If W2 £ P2 V2 , then we can create the matching derivation above, from which 
we may conclude that T(n,P2,C2) C T{2n,P,C), M(n,P2,C2) C M(2n,P,C) 
and T(l, P, C) ■ {□} • T(2, P, C) C M(A, P, C). The inclusion 

Mix, P, C) C T(l, P, C) . {□} • T(2, P, C) 

trivially holds. 

Proposition 2 can be proven in a similar way. 

A. 3 Proof of proposition 4 

Proof. Let w G C and suppose w € P V, then the top of the matching 
derivation must look like 



Wl G Pi Vl W2 &P2 V2 

-i(3w3 7^ A, W4 : W2 = W3W4 A W1W3 e A W4 € P2) 

CKLeene 

(W1,M)2) e P ^ (Vl, V2) 

CElem 

W = W1W2 € P Vl • V2 

The first equality then holds by this observation and application of Theorem 
1.1. Since W2 G C2, r(2n,P,C) C r(n,P2,C2) and M(2n,P,C) C M(n,P,C2). 
On the other hand, if & C^, we know that W2 S -£'(P2), and there must exist 
some u>i G L{Pi) with W1W2 G C for which condition c holds. By application of 
Theorem 1.1 we can reconstruct the matching derivation shown above. As such 
T(n,P2,C2) C T(2n,P,C) and M(n,P2,C2) C M(2n,P2,C). 

If wiOw2 G M(A,P,C) then, by definition, there exists some w E C such 
that w G P V, V{1) = wi and V{2) = W2- Then, by the derivation above 
and Theorem 1.1 we know that w = w\W2, w\ G L{P\) and W2 G L{P2). Then 
Wi\l\w2 G / by applying the same theorem to the third premise of CKlccnc. On 
the other hand, if ■w\C\W2 G /, we can use the theorem again to reconstruct the 
matching derivation above, showing that / C M(A, P, C). □ 

A. 4 Proof of proposition 5 

We will use the following lemma: 

Lemma 3. // {w\, W2) G P ~^ (Vi, V2) then W2 & P-^ V2 

Proof. The proof goes by induction on the matching derivation (^1,^2) G P 
Vl, V2 with a case analysis on the last rule used. In all the cases, the result either 
follows immediately from the premise of the last rule used, or follows immediately 
from the induction hypothesis. □ 



Proof. liw&C with w & ^ ^V, then the matching derivation must look Uke 



{wi,W2W-i) e Pi ■ (P2 - Pa) {Vl,W) {W2,W-i) e P2 -Ps ^ (V2,V3) 

CCON 

(101102, W3) e (Pi • P2) • P3 {Vi ■ V2, V3) 

CElem 

w = W1W2W3 € Pi • P2 [Vi ■ V2) ■ V3 

Using CElem on the second premise of CCon, we obtain € P2 ■ P3 V2 ■ V3. 

By lemma 3 u!2Ws G P2 • P3 ~^ which means = V2 • V3 by Theorem 1. By 
using CElem on the first premise of CCon, w GP' Vi ■ {V2 • V3). Ifw €P' V, 
the derivation must look like 



(W1,W2W3) GPl • (P2 •P3) {Vl,W) 

CElem 

W = W1W2W3 € Pi • (P2 • P3) --^ Vi -W 

We know by k'uima 3 that ^2^3 G P2 • P3 ^ W. By the premise of CElcm, that 
can only happen if {w2, W3) G P2 • P3 ^ V2, V3 for some V2 and V3. But then, we 
can use CCon to obtain w G P ^ {Vi ■ V2) ■ V3. 

Prom these observations, equalities (1) until (6) follow immediately. 

Next, we we will show that J = {win\w20ws \ wiW2W:i G C, wi?«2WA3 G 
P V,V{1) = wiW2,V'(ll) = wi,T^(12) = W2,V{2) = W3}. Indeed, suppose 
W1DW2W3 G M(A, P', C) and W2OW3 G M(2, P', C). This means there must exist 
some w' G C with w' G P' ~^ V. As explained above, V = Vi ■ (V2 ■ V3). Since 
V{1) = wi and V{2) = W2W3, w' = W1W2W3. Since W2OWS G M(2,P',C), there 
must exist some w" € C with w" G P' V , V'{2\) = W2 and V'{22) = W3. 
By the observations about the matching derivation of P' made above, w" = 
VW2W3 and V = V( ■ {V2 ■ V3'). Then W2W3 G P2 • P3 ~^ ■ V3' and by lemma 3 
■^2^3 G P2 • P3 V^2 • V3, so V2-V3^ ■ by Theorem 1. Thus, y(21) = W2 
and V{22) = W3. As such, J = {W1OW2OW3 \ W1W2W3 G C,wiW2'W3 G P' ~^ 
V, V{1) = wi, V{21) = W2, ^(22) = W3}. By the observations made above, this 
equals {wiDw2Djv3 \ W1W2W3 G C, W1W2W3 G P V,V{1) = wiW2,V{ll) = 
wi,V{l2)=W2,V{2) = W3}. 

The D inclusion of equalities (7) and (8) then follows immediately. Now 
suppose wgC, wGP-^y and 1^(1) = W1W2, V{2) = W3. As we can see in the 
matching derivation above, w = W1W2W3, so Win\w20w3 G J, from which we may 
conclude M(A,P,C) C {W1W2OW3 \ Wi0w2nw3 e J}. U w € C, w € P V 
with F(ll) = wi and 1^(12) = W2, then V{1) = W1W2 and there must exist 
some W3 such that W1W2W3 G C with V{2) = W3 (see the derivation above). So, 
M(1,P,C) C J/({D} • S*). The last equality follows directly from equality (8). 

A. 5 Proof of Proposition 6 

Proof. If w G C and w £ F V then there are two possibilities for the top of 
the matching derivation: 



1. Rule COrl is used: 



(wi,W2) e Pi - Pa ^ (Vi, V2) 

^ COri 

(wi,W2) € (Pi + P2) - Pa (Vi + P2, V2) 

; CELEM 

w = W1W2 € (Pi + P2) • P3 {Vi + P2) • V2 

So we can create the following matching derivation, proving w € P' 
{Vi ■ V2) + (P2 • P3) 



(wi,W2) € Pi - Pa) (^1,^2) 
CElem 

^1^12) e Pi • P3 V^i • V2 

OrI 

w = wi • W2 e (Pi • P3) + (P2 • Pa) ^ (Vi • V2) + (P2 • Pa) 

On the other hand, suppose w G P' V for some V, then the matching 
derivation could be of the same form as above. If it is we can easily construct 
the first matching derivation given here, so w € P (Vi + P2) • V2. 
2. Rule C0r2 is used: 



(wi,W2) € P2 • Pa ~^ (V^i, V2) «;iw;2^Pi-P3 
J 1 1 . CCor2 

(wi,W2) e (Pi + P2) - Pa (Pi + Vi,V2) 

; ; ; CElem 

W = W1W2 e (Pi + P2) • P3 ^ (Pi + Vi) ■ V2 

Then we can make the following matching derivation to show that w €P' 
(Pi -Pa) + (VI- 1^2) 



(W1.W2) e P2 -p.-i {VuV2) 

CEl.EM 



"'i • "'2) e P2 • P:! \ I ■ \ 2 "'i"'2 ^ Pi ■ P:! 

Or2 

w = wi • W2 e (Pi • P3) + (P2 • P3) (Pi • P3) + {Vl ■ V2) 

On the other hand, if w € P' V and we do not use Orl at the top (which 

is suggested in the previous case), the matching derivation must look like 
the one given above. But then we can easily create the matching derivation 
at the beginning of this case to show that weP^(Pi + Vi) -¥2. 

All equalities follow from these observations. 



B Proof of theorem 3 

We will prove the following, stronger theorem: 

Theorem 4. The hyperautomaton {A, /) computed by Algorithm 1 has the fol- 
lowing properties, where f{n) = {Qn,In,Fn): 



1. L{Qn,In,Fn,6A)=T{n,P,C). 

2. The automaton A is unambiguous: for every w there is at most one accepting 
run of A on w. 

3. For every w: w G C and w G P V holds iff there exists an accepting run 
T of A on w and V{n) = w^^^^ for every n G bn(P). 

4- If n G bn(P), P{n) = ■ , t is an accepting run of A, t\q^^ = and 
'^\Qn2 = (^2, J2) (both different from (—1, —1) ), then ji = Z2 and we can find 
q G Fni and q' G /„2 such that {q,ji), ■ . ■ , {q', 12) appears in t. 

Proof. The proof goes by well-founded induction on {V, where the □ ordering 
on patterns was introduced in the proof of Theorem 1. 

Base case We treat P = e, cr or Pf together. Property (1) follows directly from 
the construction and Lemma 1. Property (2) holds trivially since the computed 
automaton is a DFA. Property (3) holds since A is the only bindable node. Since 
there is no bindable node labeled with a concatenation, (4) trivially holds. 

P=Pi+P2 IfP = Pi+P2, let = if(Pi,C) and (^2, /2) = (P2, C - 

Z/(Pi)). For these hyperautomata, the theorem holds by induction. Property 
(1) follows directly from the induction hypothesis and Proposition 1 if n 7^ A. 
To show that the property holds for n = A, we use lemma 1 and the induction 
hypothesis: T(A, Pi, C)UT(A,P2,C-i(Pi)) = (L(Pi)nC)U(i(P2)nC-L(Pi)) = 
(i(Pi) U L(P2)) n C = T(A, P,C). Suppose that A is ambiguous, i.e. that there 
are two accepting runs of A on w. Then one has to be an accepting run of 
Ai on w and the other an accepting run of A2 on w (otherwise, Ai, or A2 
would be ambiguous, which is impossible by the induction hypothesis). But then, 
w G r(A, Pi, C) = L(Pi) n C and w e T{X, P2, C - L(Pi)) = ^(Pa) n (C - L(Pi)), 
which means w G ^(Pi) and w ^ L{Pi), contradiction. In the proof of Proposition 
1 we remarked that w G P V iS cither w G Pi Vi with = + P2 or 
w e P2 V2 with V = Pi + V2 and w ^ Pi. Property (3)then follows directly 
from the induction hypothesis and the fact that the accepting run must either be 
an accepting run of Ai or of A2. Property (4) follows directly from the induction 
hypothesis. 

P=<j-P2 Let (Ai,/i) = H{a,C/L{P2)) and (A2,/2) = H{P2, L{a)\C). Property 
(1) follows from the induction hypothesis and Proposition 3 if n 7^ A. It also holds 
for n = A since T(A,P,C) = 7r(M(A,P, C)) and this equals T{X,a,C/L{P2)) ■ 
T(A,P2,L(ct)\C) by Proposition (3). 

Note that we can split any accepting run (go, fco), . • . , {q-m, km) oi A on aw 
into an accepting run (50, h), {qi,ki) of Ai on a and (g^+i, fc;+i), . . . (qm, km) 
of A2 on If there are multiple such runs of A, we would get different runs 
of Ai or A2, which is impossible since they are unambiguous, so Property (2) 
holds. 

From the proof of Proposition 3 we know that aw GP~-*Fiff(7Gcr~^Vi 
and w e P2 V2 with V = Vi ■ V2. Suppose aw G C and aw G P F. By 
the induction hypothesis, we have n of Ai on a with Vi{\) = ti\q^ and T2 



of A2 on w with ¥2(71) — T2\q^^- Since ti cannot contain states of A2 and T2 
cannot contain states of n , and n , T2 is the accepting run of A, the "if" part 
of (3) follows. On the other hand, we know that we can split any accepting 
run T oi A into an accepting run n of Ai (where no states of A2 can occur) 
and an accepting run T2 of A2 (where not states of Ax can occur). By the 
induction hypothesis, a a Vi with Vi(A) = <j and w € P2 V2 with 
^2(^) = ^t2|q2 ~ ('''^)t|q2 ' which the "only if" part follows. Property 

(4) is clear for n = A from the remarks made above about an accepting run of 
A and it follows from the induction hypothesis otherwise. 

P= e-P2 We note that L{e)\C = C. The proof is then analogous to the previous 
case. 

P = PI ■ P2 Let Aj be a deterministic automaton for /, Aj^^ be a deterministic 
automaton for Ti, (^2,72) = i?(P2,C2) and A = t^HAt^ ■ A^ ■ A2) D Aj). For 
n ^ A, Property (1) follows from Proposition 4 and the fact that we do not lose 
information about the subautomata by taking the intersection and performing 
TT. Moreover, / C r(l, P, C) • {□} • T(2, P, C), so L{{{At, ■ Aa ■ A2) n Ai) = I. 
The result then follows for n = A since T(A,P,C) = 7r(Af(A, P, C)) = 7r(J). 

To prove unambiguity, we first note the following. Let B be the automaton 
obtained by computing {At^ ■ A\j ■ A2) D Aj. We assume that the states of B are 
of the form (g, s) with q G QATi-Afj-A2 and s € Qaj- If w € L{A), then there 
must exist some w' e L{B) with 7r(w) = w' . Thus, there are wi G L{Ati) and 
W2 € ^(^2) such that w = W1W2 and w' = wi^W2 G /• By definition of /, Wi 
and W2 are unique. Let r = ((go, so); ^o), • • • j ((5m, Sm), ^m) be an accepting run 
of A on w. Since A = 7r(-B) and since A^ is minimal, we can find exactly one i 
for which qi e /aq, ^i+i € and fcj = fcj+i = |u;i|. Necessarily, G i^Vj 
and qi+i G /^a • 

Suppose we have two accepting runs ti = ((<Zo, sq), fco ),..., (<7m, Sm), fcm) 
and r2 = ((^q, Sq), fcg), . . . , ((g^, sj), /C;') of A on w = cri...(T„. Then we can 
find ii and 12 for ti respectively T2 as described above. By definition of /, 
ki^ = kl^. Since both (go, ^o), • • • , (fti-i, fcii-i) and (qq, fcg), . . . , (g-^.^, fc^^-i) 
are accepting runs of At-^ on (Ti . . . crfc^^ , and since At^ and A/ are deterministic, 
((go, So), ko), ((gii_i, Sii_i), ki^-i) = ((go, Sq), fco), . . . , ((g-2_i, s^^-i), fc^^-i). 
Since the automaton is deterministic, ((g^^ , s^J, fc^J, ((g^^+i, s^j+i), fci^+i) = 
((9i2' ^U' '^U' ((9i2+i' '^L+i)' ^»2+i)- ^i^^'^ ^2 is unambiguous by induction, the 
accepting runs (gii+2, ^,1+2), . . . , (g™. A:™) and (g-,+2> ^42+2)^ • • • > (S;', ^(0 of ^2 on 
W2 must be equal. So, ri = T2, from which Property (2) follows. 

Suppose f2{n) = (Q^, I^, i^'J. By the remarks about the matching deriva- 
tions of P^ • P2 made in the proof of Proposition 4, we know that w G P iff 
w = W1W2 with 1010102 G I , Wi € Pi Vi with Vi(A) = wi and ■u;2 G P2 V2 
where V = Vi ■ ¥2- Since wiDw2 G /, W2 G C2. Suppose w & C and w V 
By the induction hypothesis of Property (3), there exists a run T2 of A2 on W2 
with ¥2(71) = W2t2\q, ■ definition of / and Ti, there exists a run n of At^ 
on wi with wi^^ig^ = wi. Then we can find g G Ia^ and g' G i^^n such that 



r' = Ti, {q, \wi\), {q', \wi\ + 1),T2 is an accepting run of Ai ■ ■ A^ on ■wiC\W2 
(here, T2 is obtained from T2 by adding \wi \ + 1 to the indexes). Since we know 
that wi'C\W2 e /, there is a run r" of Ai on it. By combining the r' and t", we 
can get an accepting run r of A on wiW2- It is easy to see that Wt\q^ = wi and 
Wt\q, = W2t2\q, ' from which the "if" part of Property (3) follows. The "only if" 
part follows by reasoning in the reverse direction. Indeed, if w E A, w £ 7r(/), so 
w G C and we explained earlier that a run r of A on w gives us an accepting run 
of Ati on wi and an accepting run of A2 on W2 with w = W1W2 and w = win«;2 
in I. Since the states of At^ and A2 are disjunct, the induction hypothesis can 
be applied, immediately giving the desired result. 

Property (4) immediately follows from the induction hypothesis if n 7^ A, 
and from the remarks made about the accepting run of A otherwise. 

P = {P,.P2)-Ps Let P' = Pr(P2-P3), {A', /') = H{T>', C) and f{n) = {Q'^, I'„, F'^). 
By construction, Ii = {Q[ U Q21) ^ ^A' and Fi contains those nodes of F21 from 
which there is a g G F22 such that 6A'{q, A, q') and such that there is a path from 
q' to a state in F^' ■ 

We will first prove Property (3). Let w <= C and w G P V. In the proof 
of Proposition 5 we noted that this holds iff w € P' F' with V'{X) = V{X), 
y'(l)y'(21) = y'(ln) = y(lln), V'{21n) = V{12n) and V'{22) = V{2). 

Furthermore, for n G {A, 1, 2, 21, 22}, V'{n) j^L. By the induction hypothesis, 
we have an accepting run of t of A' on w' with V'{n) = Wt\q, ■ The "if" part 
then follows by construction since V{1) = Wr\Q^ = t\q['''\q'2^ ~ V^'(l)y(21) . 
The "only if" part can be proven by similar arguments, reasoning in reverse. 

Property (1) follows from the induction hypothesis and Proposition 5 if n 1. 
Let B = {Qi,Ii,Fi,6a') and suppose w G T(1,P,C), i.e. there exists some 
w' € C with G P y and V{1) = w. We know that there exist Wi, W2 and 
ws such that w = Wi'W2, ^(H) = wi, ^(12) = W2 and V{2) = w^. Equally, 

G P' V with V'{1) = wi, V'{21) = W2 and V'(22) = ^3. Then there 
is a run r of A' on w' with w', = wi, w', = W2 and w', = W3. Let 

t|q^^ = and t\qi_^_^ = {12, ji) By the induction hypothesis of (4) on n = 2, 

ji = h and we can find qi G F^i and 52 G I22 such that (gi, Ji), . • . , (52; ii) occurs 
in r. Necessarily, gi G .Fi, so w G L{B). On the other hand, suppose w G L{B). 
Let Ti = (<7o, fco): • • ■ J {ill ki) be an accepting run of B on w. Then there is some 
G I2 such that <5a'(9;. A, gj+i) and such that there is a path from to a 
state in F^'- Let W3 be the word spelled on this path. We can then construct 
T2 = (qi+i, ki+i), • • • , {qm, km) such that Ti, T2 forms the accepting run of A' on 
wws. Obviously, min(pos(T, Qi)) = 0, max(pos(T, Q2)) = l^wal, ki G pos(T, Qi)) 
and ki+i G pos(r, Q2)- Is it possible that max(pos(T, Qi)) > \w\ = ki or that 
min(pos(T, Q2)) < l^j = Suppose it is, then we would have ww^ G P ^ 1^ 

with 1^(1) = iTi . . . aj^ and V{2) = aj^ . . .dn with n = \ww3\ and either ji > |w| 
or j2 < \w\. Now, necessarily, wws = V{X) which, as we noted in the proof of 
Proposition 5, should equal V{l)-V{2). However, the length of cti . . . cFj^cFj.^ ■ ■ - cin 
is always greater than n, contradiction. As such G C and ww^ G P F 
with V{1) = w, hence w G T(1,P,C). 



Property (4) also follows from these observations while Property (2) follows 
directly from the induction hypothesis. 



P = (Pi + P2) ■ P3 Let P' = Pi ■ P3 + P2 • P3, {A', /') = Hi?', C) and /'(n) = 
{Q'^,r^,F^). Then, A' = A^yj A2 where {A^, h) = H{P, ■ P3,C), (Aa,^) = 
F(P2 • P3, C - L(Pi • Ps)) and /'(A) = A, /'(In) = /i(n) and /'(2n) = ^(n). 
Property (1) follows from the induction hypothesis on Pi • P3 and P2 • P3, and the 
combination of Propositions 6 and 1 if n 7^ A. For n = A, we showed in the proof 
of Proposition 6 that w€P-^ViSwGP-^V'. The property then follows 
from the induction hypothesis, since V{\) = V'{\) = w' . Property (2) follows 
directly from the induction hypothesis and the fact that a word cannot be in 
L(Pi • P3) n C and L(P2 • P3) n (C - L(Pi • P3) at the same time. Property (3) 
follows from the induction hypothesis, and the relation between V and V when 
w V and w G P ~^ F' we gave in the Proof of proposition 6. In the same 

proof we (indirectly) showed that when w & ? ^ V holds, either V'(ll) and 
V'{12) are different from _L or V^'(21) and V'{22) are, but not both. Property 
(4) follows from this observation and the induction hypothesis. 



